massachusetts  institute  of  technology  —  artificial  intelligence  laboratory 


Direction  Estimation  of 
Pedestrian  from  Images 


Hiroaki  Shimizu  and  Tomaso  Poggio 


Al  Memo  2003-020  August  2003 

CBCL  Memo  230 


©2003  massachusetts  institute  of  technology,  Cambridge,  ma  02139  usa  —  www.ai.mit.edu 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

AUG  2003  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-08-2003  to  00-08-2003 

4.  TITLE  AND  SUBTITLE 

Direction  Estimation  of  Pedestrian  from  Images 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Massachusetts  Institute  of  Technology, Artificial  Intelligence 

Laboratory, 77  Massachusetts  Avenue, Cambridge, MA, 02139 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

18.  NUMBER  19a.  NAME  OF 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE 

unclassified  unclassified  unclassified 

12 

standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Abstract 


The  capability  of  estimating  the  walking  direction  of  people  would  be  useful 
in  many  applications  such  as  those  involving  autonomous  cars  and  robots. 

We  introduce  an  approach  for  estimating  the  walking  direction  of  people 
from  images,  based  on  learning  the  correct  classification  of  a  still  image  by 
using  SVMs.  We  find  that  the  performance  of  the  system  can  be  improved  by 
classifying  each  image  of  a  walking  sequence  and  combining  the  outputs  of  the 
classifier. 

Experiments  were  performed  to  evaluate  our  system  and  estimate  the  trade¬ 
off  between  number  of  images  in  walking  sequences  and  performance. 
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1  Introduction 


In  recent  years  many  applications  for  automatically  detecting  visual  objects 
such  as  obstacles,  people  and  faces  were  introduced.  There  are,  however,  only 
a  few  attempts  focusing  on  estimating  the  walking  direction  of  people.  In  this 
report  we  describe  an  approach  to  the  problem. 

We  consider  the  challenge  as  similar  to  estimating  posture  of  a  human,  of  a 
face  and  of  hands.  In  all  these  problems,  there  are  two  basic  kinds  of  approaches: 
model-based  and  learning-based. 

A  model-based  approach  attempts  to  recover  a  pose  by  analyzing  input  im¬ 
ages  and  comparing  them  to  available  models.  One  of  the  most  popular  model- 
based  approach  is  to  construct  2D  ellipsoid  or  stick  models  which  are  then  used 
in  a  comparison  driven  by  features  obtained  from  input  images  [1][2][3].  The 
deformable  surface  in  XYT  space  is  used  as  a  feature  for  analyzing  gait  [1].  In 
the  work  by  Guo  et  al.  [2] ,  the  skeleton  of  the  silhouette  of  a  walking  human  is 
obtained  and  then  compare  to  a  2D  stick  model.  In  Chang  [4],  ribbons  corre¬ 
sponding  to  arms  and  legs  are  used  for  analyzing  gait.  A  statistical  description 
of  blobs  is  used  in  the  people  detection  system  developed  by  Wren  et  al.  [5]. 
This  2D  model-based  approach  usually  requires  the  segmentation  of  the  body 
parts  of  a  human  from  the  background;  it  also  requires  sequences  of  images  in 
order  to  track  the  parts  of  the  human  body. 

Another  popular  model-based  approach  is  to  use  an  accurate  3D  model  with 
information  about  the  kinematic  and  the  shape  properties  of  the  human  body 
[6]  [7].  This  approach  is  usually  quite  difficult  since  it  requires  an  accurate  prior 
model. 

Learning-based  approaches  estimate  directly  the  parameters  of  the  pose  of 
the  human  body.  In  these  approaches,  it  is  not  always  necessary  to  segment  a 
explicit  shape  of  body  parts.  In  many  cases,  low-level  2D  features  such  as  shape, 
motion,  color  and  position  of  the  points  of  interest  are  used  by  learning-based 
classifiers. 

In  the  work  by  Freeman  [8],  the  x-y  image  moments  and  orientation  his¬ 
togram  of  the  shape  are  used.  Low-level  optical  flow  induced  by  the  motion  of 
humans  can  be  also  used  [9] .  Deformable  shape  models  are  applied  to  the  track¬ 
ing  of  pedestrian  in  work  by  Baumberg  [10].  Image  pixels  are  sometimes  used 
as  input  directly.  Darrell  et  al.[ll]  use  image  pixels  directly  for  pose  estimation 
of  hands.  Quite  a  few  papers  deal  with  faces.  The  local  models  obtained  from 
a  large  database  of  examples  are  used  for  estimating  a  pose  of  a  human  upper 
body  [20].  Kumar  [17]  uses  a  linear  morphable  model  to  estimate  the  open¬ 
ing  of  the  mouth  directly  from  the  image.  In  the  work  by  Heisele  et  al.[19],  a 
component-based  method  is  used  to  detect  faces  in  still  images.  The  parameters 
of  the  rotation  can  be  estimated  by  using  the  geometry  of  each  of  the  face  com¬ 
ponents.  The  work  by  Oren  [12]  uses  wavelet  coefficients  as  low-level  features 
and  applies  Support  Vector  Machine  classifier  to  them  to  detect  pedestrians. 
Other  classification  methods  as  decision  tree  [13]  and  nearest  neighbors  [9]  are 
also  popular. 
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I  input  image  | 


estimated  deection 


Figure  1:  Overview  of  our  direction  estimation  method  by  a  single  image 


The  approach  described  here  starts  from  a  single  image  for  direction  estima¬ 
tion  and  allows  any  background.  We  choose  a  learning-based  method  since  the 
model-based  methods  require  automatic  segmentation  of  body  parts  for  pose  re¬ 
covery.  We  choose  a  regularization  technique  such  as  Support  Vector  Machines 
because  it  was  successfully  used  in  many  computer  vision  applications  and  well 
founded  in  statistical  learning  theory  [14].  We  use  frame  sequences  only  for 
improving  the  direction  estimation.  In  this  case  we  apply  the  same  technique 
to  each  image  and  decide  the  final  direction  by  majority  vote  among  the  classi¬ 
fications  of  each  image  in  the  sequence.  In  this  project  we  do  not  consider  the 
detection  of  people  in  images  and  assume  that  they  have  been  already  detected 
[15]. 

2  System  Overview 

We  decribe  the  algorithm  for  estimating  walking  directions  (see  Fig.l).  We 
use  Haar  wavelets  to  generate  feature  vectors  of  the  input  images  and  train 
16  individual  classifiers  each  one  corresponding  to  certain  walking  direction. 
Before  training,  we  separate  the  training  data  into  two  groups  -  one  consisting 
of  8  directions  such  as  45.0  x  i  {i  =  0,...,7)  and  the  other  consistingt  of  the 
other  8  directions  such  as  45.0  x  i  +  22.5  (i  =  0, . . . ,  7).  Each  individual  classifier 
is  trained  on  one  direction.  At  run  time,  each  of  the  trained  classifiers  produces 
a  real- valued  output.  The  system  choses  the  most  likely  direction  by  a  decision 
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Figure  2:  Overview  of  our  method  by  multiple  images 


function  which  is  based  on  the  outputs  of  a  classifier  for  the  direction  and  of  the 
two  classifiers  corresponding  to  the  neighboring  directions. 

In  order  to  estimate  directions  more  accurately  we  apply  this  technique  to 
each  image  of  walking  sequences  and  combine  the  individual  classifications  (see 
Fig. 2).  We  explain  the  details  in  the  following  sections. 

3  Feature  extraction 

Haar  wavelet  coefficients  (8x8  pixels)  are  used  to  generate  feature  vectors  for 
each  image.  The  wavelets  represent  an  overcomplete  set  at  each  scale  since  they 
overlap  75  percent  with  the  neighboring  wavelets  in  the  vertical  and  horizontal 
directions  [16]  [17].  We  use  three  different  orientations(i.e.  horizontal,  vertical 
and  diagonal)  of  Haar  wavelets.  This  method  results  in  a  thorough  and  compact 
representation  of  the  input  images  (See  Fig. 3). 


4  Classification 

We  use  Support  Vector  Machines  to  classify  the  feature  vectors  resulting 
from  the  Haar  wavelet  representation.  The  choice  of  the  kernel  function  usually 
plays  an  important  role  on  the  overall  performance  of  SVM-based  classification. 
From  the  results  of  our  experiments,  we  chose  a  linear  kernel  function  for  our 
system. 
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Figure  3:  Samples  of  wavelet  coefficients 


In  our  approach  we  decided  to  classify  the  walking  direction  into  one  of  16 
directions  (i.e.  0,  22.5,  45.0,  ...,  315.0,  337.5,  eg  every  22.5  degrees).  To  achieve 
our  goal  we  trained  16  individual  classifiers,  each  corresponding  to  one  of  the 
directions.  When  the  system  attempts  to  classify  an  image  with  a  walking 
direction  which  does  not  correspond  to  any  of  the  trained  16  directions,  the 
classifier  closest  to  the  unknown  direction  is  supposed  to  produce  the  largest 
output  of  any  of  the  classifiers.  If  this  assumption  were  true,  we  could  assign 
any  new  direction  to  one  of  the  trained  directions. 

We  separate  the  training  data  into  two  groups  -  8  directions  such  as  0,  45, 
90,...,  315,  and  the  other  8  directions  such  as  22.5,  67.5,  112.5,...,  337.5.  Each 
classifier  is  trained  on  either  of  the  groups  corresponding  to  the  appropriate 
direction.  We  chose  this  approach  since  it  gives  better  estimation  results  than 
the  alternative  method  of  training  each  of  the  16  classifier  on  a  single  group  of 
16  directions  (See  Fig. 5). 

At  run  time,  each  of  the  16  classifiers  of  the  system  produces  16  outputs. 
The  following  decision  function  based  on  the  outputs  of  the  16  classifiers  is  used 
to  decide  the  estimated  direction: 

•  let  i  be  ith  target  direction  correspond  to  one  of  16  directions, 

•  let  Si  be  an  output  of  ith  classifier 

•  and  let  Ni  be  the  closest  neighboring  directions  of  ith  target  direction. 


5 


•  For  any  of  the  16  directions,  the  decision  function  f{si)  is  defined  as 
follows: 

f{Si)  =UJ  X  Si+  ^  Sj, 
j&Ni 

where  w  is  a  weight  of  a  target  direction. 

•  We  pick  up  k  correspond  to  the  highest  value  of  the  decision  functions: 

k  =  arg  max(/(si);  i  =  1,2, ,  16) 

I 

In  the  case  of  evaluating  the  45  degrees  direction,  we  use  the  outputs  of  the 
classifier  for  45  degrees  as  a  target  direction  and  those  for  22.5  and  67.5  degrees 
as  neighboring  directions: 

/(S45)  =  y  X^S45  +  (S22.5  +  567. 5) 

target  neighbors 

According  to  our  experiments  (see  Fig. 6),  more  than  907,  of  testing  data  were 
classified  in  terms  of  the  correct  direction  or  one  of  the  two  closest  neighboring 
directions.  This  result  suggest  that  we  may  achieve  more  accurate  results  by 
using  multiple  images  in  a  walking  sequence.  To  estimate  the  direction  from  a 
sequence  of  images,  we  apply  the  above  procedure  to  each  image  in  the  sequence 
and  decide  the  final  direction  by  choosing  the  most  frequent  direction  (see  Fig. 2). 
When  more  than  two  directions  have  the  same  frequency,  we  calculate  the  sum 
of  all  output  scores  of  each  direction  and  chose  the  direction  corresponding  to 
the  greatest  sum. 

5  Experiments 

The  training  examples  were  obtained  from  the  pictures  of  walking  people 
taken  under  different  lightning  and  in  different  places.  The  height  of  people  in 
all  training  images  were  normalized  to  the  same  size.  The  size  of  each  train¬ 
ing  image  was  95x151  pixels.  As  we  described  in  Section. 4,  we  separated  the 
training  images  of  16  directions  into  two  groups.  All  of  the  classifiers  were 
trained  with  1000  positive  and  7000  negative  examples  in  each  group.  The  pos¬ 
itive  examples  contain  the  images  of  the  direction  correspond  to  the  classifier 
and  the  negative  examples  contain  those  of  the  other  7  directions  in  the  same 
group.  For  instance,  the  classifier  for  45  degrees  was  trained  with  the  images 
of  45  degrees  as  positive  examples  and  with  those  of  the  other  7  directions  (i.e. 
0,90,135,180,. ..  ,315)  as  negative  examples.  The  trained  classifiers  were  run 
over  2400  testing  images  (150  images  for  each  direction).  As  shown  in  Fig. 4, 
one  walking  cycle  consisted  of  5  to  6  images.  There  is  no  overlap  between  the 
testing  and  training  images. 

We  evaluate  the  recognition  rates  of  our  system  as  a  function  of  the  number 
of  frames  (  between  1  and  10)  in  the  walking  sequences.  Thus  we  tested  0  to 
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Figure  4:  Two  sample  sequences  of  walking  images 


2  cycles  of  the  walking  sequcences.  The  result  of  our  experiments  is  shown 
in  Fig. 7  and  Fig. 8.  In  Fig. 7,  we  can  see  that  5-6  frames,  which  correspond  to 
about  1  cycle  of  the  walking  sequence,  is  necessary  and  sufficient:  increasing  the 
number  of  frames  beyond  6  does  not  improve  the  estimate  of  the  direction.  If 
accuracy  is  estimated  in  terms  of  the  correct  direction  and  the  two  neighboring 
ones,  performance  with  5  —  6  frames  is  about  the  same  as  with  a  single  frame, 
which  is  not  surprising  since  the  latter  is  already  quite  high.  Performance  in 
this  case  seems  to  improve  with  10  frames  (see  Fig. 8). 
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Figure  5:  Recognition  rates  of  two  kinds  of  classification  methods 
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Figure  6:  Recognition  rates  of  a  target  only  and  a  target  +  closest  neighbors 


Figure  7:  Recognition  rates  of  the  estimation  of  target  directions  by  multiple 
images 


Figure  8:  Recognition  rates  of  the  estimation  of  target  and  neighboring  direction 
by  multiple  images 
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6  Conclusion 


In  this  paper,  we  presented  a  method  for  estimating  the  direction  of  walking 
by  a  human  from  a  single  image.  We  extended  this  method  to  image  sequences 
by  applying  the  same  technique  to  each  frame  and  combining  the  classification 
results.  Our  approach  is  capable  of  handling  variations  in  lightning  and  image 
background;  it  is  capable  of  estimating  walking  direction  even  when  only  a  single 
image  is  available.  This  may  be  an  advantage  in  cases  in  which  the  system  fails 
to  track  the  pedestrians  in  a  video  for  several  frames. 

We  found  the  interesting  result  that  a  cycle  of  walking  sequence  improves 
direction  estimation;  longer  sequences  do  not  help. 

As  shown  in  Fig. 5,  our  approach  can  classify  more  than  907,  of  test  images 
into  the  correct  direction  and  neighboring  directions  from  a  single  image. 
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